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WO 98/14619 PCT/US97/18344 

METHODS FOR GENERATING AND ANALYZIN G TRANSCRIPT MARKERS 

TECHNICAL FIELD 
The present invention relates generally to the field of molecular biology and specifically 
to rapid, high-throughput gene discovery methods that facilitate genome closure and to methods 
5 for analyzing gene expression patterns. 

PACKGROUNDAR T 
Many genes have been isolated and their structures determined since the introduction of 
recombinant cDNA technology. It has been predicted that through the efforts of the Human 
Genome Project, sequencing of the entire human genome (genome closure) will be accomplished 
10 sometime between 2001 and 2005 (Boguski, 1995, Molecular Medicine 333:645-647). 
Strategies for achieving genome closure include methods for constructing 3' directional 
normalized libraries which equalize cDNA representation (Soares et al. 1994, Proc. Natl. Acad. 
£ci. 91:9928 and Patanjali et al, 1991- Proc. Natl Acad. Sci . USA 88:1943-1947) and methods 
for creating cDN A libraries based on subtractive hybridization techniques which provide methods 
15 for isolating the coding sequences of genes which are differentially expressed, such as during 
development or in disease states which are described in Rubenstein et al (1990, Nucleic Acids 
Research . 18:4833-4842); Travis et al. (1988, P ros, Nat|, Acfld, Sci . USA, 85:1696-1700). 

Elucidation of gene expression represents another level of complexity equally important 
to the elucidation of genetic structure. The generation of a gene expression pattern can be used 
20 directly as a diagnostic profile or as a gene discovery method. Seilhamer et al. (WO 95/2068 1 , 
filed January 27, 1995) disclose methods for the high-throughput sequence-specific analysis of 
cDNAs and generation of transcript images. Matsubara et al (WO 95/14772, filed November 
1 1,1994) disclose methods for generating 3' directed cDNA libraries which accurately reflect the 
abundance ratio of mRNA in a cell. Velculescu et al. (1995, Science 270:484-487) describe a 
25 method for the analysis of gene "tags" or transcripts which uses type lis restriction enzymes and 
involves the generation and analysis of short, 3' nucleotide sequences which may inherently 
contain a substantial amount of 3' non-coding information. Kato (1995, Nucleiq Acids Research 
23:3685-3690) describes a method for the identification of 3' end cDNA fragments which 
involves the use of type lis restriction enzymes and PCR methodology. Kato notes that there are 
30 several technical limitations to the method including the presence of PCR generated artifacts and 
the fact that cDNA sequences lacking enzyme recognition sites will not be displayed. 

Aoto et al. (1995, Eur. J. Biochem. 234:8-15) describe isolation of a cDNA clone 
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obtained from a cDNA library of mRNA prepared from cells treated with 5-azacytidine. The 
deoxycytidine analog, 5-aza-2'-deoxycytidine (5-azadCyd), is highly toxic in cultured cells and 
animals and has been used clinically as an anti-cancer agent. 

5-azadCyd has been used experimentally as a DNA methylation inhibitor to induce gene 
5 expression and cellular differentiation (Juttermann et al. 1994, Proc. Natl. Acad. Sci. USA 91 : 
11797-11801). 

In spite of the availability of methods designed to facilitate gene discovery and elucidate 
gene expression, there remains a need in the art for methods which will expedite the process of 
gene discovery in a rapid, high-throughput manner, thereby contributing to the process of gene 

10 closure. There also remains a need in the art for methods for analyzing gene expression patterns 
in a rapid, high throughput manner. 

DISCLOSURE OF THE INVENTION 
The present invention relates, in part, to rapid, high-throughput gene discovery methods 
that facilitate genome closure and to methods for analyzing gene expression patterns. The 

15 present invention also relates to methods for the rapid, sequence-specific identification of 

transcripts derived from an mRNA population. The present invention further relates to methods 
for extending the nucleotide sequences of partial transcripts in a high-throughput manner using 
polymerase chain reaction technology. 

The present invention provides rapid, high-throughput methods for generating and 

20 analyzing transcript markers from the 5 f most end of cDNAs and methods for generating and 
analyzing two, discontinuous transcript markers from a single cDNA providing the advantage of 
obtaining more information from a single transcript than is possible by current methods. In one 
aspect of the present invention, the two discontinuous markers are derived from both the 5 1 and 3' 
ends of a single cDNA and in another embodiment of the present invention, the two 

25 discontinuous markers are derived from random areas of the cDNA. 

In one aspect of the present invention, a method is provided for generating and analyzing 
transcript markers from the 5' most end of individual cDNAs of a cDNA library comprising the 
steps of obtaining a cDNA library comprising individual cDNAs having a first restriction 
endonuclease site for a restriction endonuclease that digests the cDNA at the 5 f most end within 

30 an expected distance from its recognition site and a second endonuclease restriction site; 

subjecting the cDNAs to digestion with the first and the second restriction endonuclease thereby 
excising transcript markers from the 5' most end of the individual cDNAs; ligating said transcript 
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markers to a vector; transforming said vector containing the transcript markers in a host cell; 
culturing said host cells; and performing nucleic acid sequence analysis of the transcript markers. 
This method can be used for gene discovery purposes or transcript imaging purposes. This 
method can be combined with PCR based technology for the rapid nucleic acid sequence analysis 

5 of the transcript markers. 

In another aspect of the present invention, a method is provided for generating and 
analyzing non-contiguous transcript markers derived from the 5 1 most and 3' most end of 
individual cDNAs, comprising the steps of obtaining a cDNA library comprising individual 
cDNAs having a first restriction endonuclease site at the 5' most end and at the 3 1 most end for a 

10 restriction endonuclease that digests nucleic acid within an expected distance from the first 
endonuclease recognition site, and a second endonuclease restriction site; subjecting the cDNAs 
to digestion with the first endonuclease thereby creating linearized cDNAs containing transcript 
markers from the 5* most end and the 3* most end of the individual cDNAs; self-ligating said 
linearized cDNAs thereby joining the transcript markers from the 5' most and 3' most ends to 

15 create cDNAs containing non-contiguous transcript markers; transforming said linearized cDNAs 
in a host cell and culturing said host cells; isolating said cDNA from said host cells; digesting the 
cDNA with the second restriction endonuclease thereby excising the non-contiguous transcript 
markers; ligating said excised transcript markers to a second vector; transforming said second 
vector containing the transcript markers in a host cell and culturing said host cells; and 

20 performing nucleic acid sequence analysis of the transcript markers. 

In another aspect of the present invention, a method is provided for the rapid, sequence- 
specific identification of cDNAs derived from a human mRNA population, comprising the steps 
of obtaining a cDNA library comprising individual cDNAs wherein said cDNAs contain first 
restriction endonuclease sites for an endonuclease having a 4 base pair recognition site and 

25 wherein said cDNAs are cloned into a first vector lacking the first restriction endonuclease sites; 
subjecting the cDNA library to digestion with the first restriction endonuclease thereby creating 
linearized cDNAs containing a portion of the original cDNA; ligating an adapter to said 
linearized cDNAs wherein the adapter contains a second restriction endonuclease site for an 
endonuclease that cleaves within an expected range of its recognition site thereby creating 

30 cDNAs containing two non-contiguous transcript markers joined by the adapter; digesting the 
cDNAs with the second restriction endonuclease thereby excising the transcript markers from 
said cDNAs; concatenating said excised transcript markers; ligating said concatenated transcript 
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markers to a second vector; transforming said second vector containing the transcript markers in 
a host cell and culturing said host cells; and performing nucleic acid sequence analysis on the 
transcript markers in said host cells. 

The present invention also provides novel methods and vectors used in preparation of 
5 cDNA libraries containing transcript markers. The present invention also provides novel, high- 
throughput methods for obtaining complete nucleotide sequence information on transcript 
markers. In one embodiment of the present invention, cDNA libraries containing a serial 
arrangement of multiple transcript markers are constructed and subjected to high-throughput 
nucleic acid sequence analysis in a multi-well format using polymerase chain reaction (PCR) 
1 0 technology and specialized PCR primers. 

In additional aspects of the present invention, the cDNA libraries are constructed to bias 
the cDNA population toward rare cDNAs. In one aspect of the present invention, a cDNA library 
is constructed by normalization techniques and in another, the cDNA library is constructed using 
subtractive hybridization techniques. In yet another aspect of the present invention, the cDNA 
15 library has been constructed from mRNA treated with demethylating agents, such as, 5-aza-2' 
deoxycytidine and 5-azacytidine, which induces the transcription of silent genes. In an additional 
aspect of the present invention, the cDNA library has been constructed using oligo dT primers 
and in another aspect, the cDNA library is constructed using random primers. 

BRIEF DESCRIPTION OF DRAWINGS 
20 Figure 1 illustrates a general schematic of cDNA library construction. 

Figure 2 illustrates Array of Transcript Markers (ATM) Strategy 1 for constructing a 
cDNA library containing an array of transcript markers derived from the 5' most end of a cDNA 
(5* transcript markers). 

Figure 3 illustrates ATM Strategy II for constructing a cDNA library containing an array 

25 of 5' transcript markers. 

Figures 4A-4B illustrate ATM Strategy III for constructing a cDNA library containing an 
array of 5' transcript markers. 

Figures 5A-5B illustrate ATM Strategy IV for constructing a cDNA library containing an 
array of 5' transcript markers. 
30 Figures 6A-6B illustrate ATM Strategy V for constructing a cDNA library containing an 

array of 5' transcript markers. 

Figure 7 illustrates a strategy for simultaneously obtaining two non-contiguous transcript 
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markers from both the 5' and 3' end of cDNA. 

Figure 8 illustrates a strategy for obtaining non-contiguous transcript markers from 
random areas of a cDNA. 

Figure 9 illustrates a sequence specific approach to the identification of all transcribed 
5 genes from an mRNA population. 

Figure 10 is a list of 12 known sequences identified by the method illustrated in Figure 2. 

Figure 1 1 illustrates the toxicity curve for THP1 cells treated for 3 days with 5-aza-2'- 
cytidine. 

Figures 12A-12B illustrate examples of Type lis restriction endonucleases useful for 
10 generating ATM libraries. Bpml, Bsgl and Eco57I are three restriction endonuclease that cleave 
16 bp away from their recognition site with a 2 nucleotide 3' overhang. Bpml is illustrated in 
Figure 12 A. N's represent nucleotides found adjacent to the restriction site (e.g., the 5' end of a 
cDNA). Treatment with T4 DNA polymerase removes the 3' extension which leaves 14 bp 
derived from the adjacent sequence. Figure 12B illustrates that the restriction endonuclease 
15 BSMFI cleaves 14 bp away from its recognition site with a 4 nucleotide 5' overhang. This end 
may be filled in by treatment with a DNA polymerase which results in 14 bp derived from 
adjacent sequence. 

Figure 13 illustrates a vector useful in the strategy of Figure 8 which has all restriction 
endonuclease sites having a 4 base pair recognition site removed. 
20 Figures 14A-14B illustrate a typical concatenated array of transcript markers. 

Figure 15 illustrates an abundance profile of THP cells treated with S-aza-2' 
deoxycytidine. The black-shaded area represents control clones, the gray shaded area represents 
S-aza-2 1 deoxycytidine treated cells and the gray line represents the abundance profile of the 5- 
aza-2 1 deoxycytidine treated cells. 
25 MODES FOR CARRYING OUT THE INVENTION 

Before the present compounds, variants, formulations and methods for making and using 
such are described, it is to be understood that this invention is not limited to the particular 
compounds, variants, formulations or methods described, as such variants, formulations and 
methodologies may, of course, vary. It is to be understood that the terminology used herein is for 
30 the purpose of describing particular embodiments only, and it is not intended to be limiting since 
the scope of the present invention will be limited only by the appended claims. 

It must be noted that as used in the specification and the appended claims, the singular 
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forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise 

which will be known to those skilled in the art or will become known to them upon reading this 

specification. 

L Definitions 

5 As used herein, the term "transcript marker derived from a cDNA library" refers to an 

isolated polynucleotide derived from an individual cDNA and being preferably from about 10 
base pairs to about 20 base pairs in length derived from an individual cDNA. 

As used herein, the term "non-contiguous transcript marker from an individual cDNA" 
refers to two polynucleotides which are not adjacent to one another under naturally occurring 

10 conditions, but which are constructed to exist in tandem, with each polypeptide being preferably 
from about 10 base pairs to about 20 base pairs in length. 

As used herein, the term "3' transcript marker" or "transcript marker from the 3* most end 
of a cDNA" refers to an isolated polynucleotide derived from the 3' most end of an individual 
cDNA and being preferably from about 10 base pairs to about 20 base pairs in length. 

15 As used herein the term "5" transcript marker" or "5' transcript marker derived from a 

cDNA library" refers to an isolated 5' most nucleic acid sequence of a cDNA which is preferably 
about 10 base pairs to about 20 base pairs in length. As used herein, "5 ! most" means that a 5' 
transcript marker may represent the 5' end of the full-length coding region of a cDNA and may 
include 5' untranslated sequences. Alternatively, a 5' most transcript marker may represent an 

20 internal coding region of a individual cDNA. Each 5' transcript marker can reflect the expression 
of an individual cDNA. 

As used herein, the term "adapter" refers to a synthetic fragment of nucleic acid which is 
ligated to a cDNA and which may contain recognition sites for restriction endonucleases. 

As used herein the term "5* adapter" refers to a synthetic fragment of nucleic acid which 

25 is ligated onto the 5' end of cDNAs prior to ligation to a vector. In the present invention, the 5* 
adapter contains a first restriction endonuclease site which digests nucleic acid at a expected 
distance from its enzymatic recognition site and at least one other restriction endonuclease site. 
The second restriction endonuclease site is for a restriction endonuclease which digests the cDNA 
to form blunt ends or 5' or 3' overhangs. In a preferred embodiment of the present invention, 

30 digestion of the cDNA with the first and second restriction enzymes excises transcript markers 
which are at least 20 base pairs in length and contain at least 14 base pairs of cDNA sequence and 
6 base pairs of adapter sequence. 
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As used herein the term "array(s) of transcript markers" or "ATM" refers to the collection 
or serial arrangement of transcript markers prepared by concatenation of individual transcript 
markers. As used herein, an "ATM cDNA library" is a cDNA library containing transcript 
markers as inserts. The inserts may be individual inserts, multiple inserts or a serial arrangement 
5 of multiple inserts which may be in sense or antisense orientation. 

As used herein the term "full-length" coding region refers to the cDNA sequence for the 
entire transcribed mRNA for a particular protein from initiating methionine to the poly A tail. 

As used herein the term "type lis restriction endonuclease" or t4 type lis enzyme " refers to 
that category of restriction endonucleases that digests nucleic acid at an expected distance from 
10 the enzymatic recognition site. Preferred examples of type lis enzymes for use in the present 
invention are those which digest nucleic acid at least 10 base pairs from the recognition site or 
which digest nucleic acid less than 10 base pairs from the recognition site but can be filled in 
enzymatically to give a 10 base pair transcript marker. Examples of such Type lis enzymes 
include, but are not limited to Bpml, Bsgl Eco57I and BsmFI. Examples of type lis restriction 
15 endonucleases is illustrated in Figures 12A-12B. As will be understood by those of skill in the 
art, it is possible to alter enzymatic digestion conditions to change the nucleic acid digestion 
position of the type lis restriction endonuclease. 

As used herein the term "type Ilsg restriction endonuclease" or "type Hsg enzyme" refers 
to that category of restriction endonucleases that digests nucleic acid within an expected range of 
20 its enzymatic recognition site. Examples of type Ilsg restriction enzymes include, but are not 
limited to Bcgl. 

As used herein the term "concatenating" refers to the process of ligating multiple 
transcript markers prior to ligation in a cloning vector. 

As used herein the term "normalized cDNA library" or "normalized library" refers to a 
25 cDNA library constructed in such a manner as to reduce the redundancy in high- level abundance 
cDNAs. 

As used herein the term "subtractive hybridization" (when referring to construction of a 
cDNA library) refers to a process wherein a first population of nucleic acid is hybridized with a 
second labelled population of nucleic acid (driver) and the resultant nucleic acid hybrids removed 
30 to completion thereby identifying and isolating a set of nucleic acid sequences unique to the first 
population of nucleic acid which may be used in the construction of a cDNA library. 

As used herein the term "selected set" of random primers refers to a set of primers 
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designed to anneal to a specific region of mRNA. In a preferred embodiment herein, the selected 
set of random primers is designed to anneal to the 5' most region of mRNA used as starting 
material for cDNA synthesis. 

As used herein the term "transcript imaging" refers to a method of determining the 
5 relative abundance of individual transcript markers in a cDNA library. The term relative 
abundance refers to the number of times an individual transcript marker appears relative to the 
total number of transcript markers identified. 

As used herein the term "non-amplified growth" of host cells refers to growth conditions 
which allow for uniform growth of recombinant host cells. 
10 As used herein, the term "high abundance" or "high-level abundance" messages exist at 

greater than 10,000 copies per cell. 

As used herein, the term "mid abundance" or "mid-level abundance" species exist at 100 
to 400 copies per cell. 

As used herein, the term "low abundance" or "low-level abundance" species are found at 
15 less than 15 copies per cell. This low-abundance class represents 20 to 50 percent of the unique 
transcripts in the cell. 

As used herein the term "wobble primer" refers to a sequencing primer degenerate at the 
3'-end to allow sequencing for all three possible bases following the poly A tail. 

The present invention relates to rapid, high-throughput gene discovery methods that 
20 facilitate genome closure and to methods for analyzing gene expression patterns. The present 
invention is based, in part, upon the discovery of methods for the generation of transcript markers 
derived from the 5' end of cDNAs contained within a cDNA library. The present invention is 
also based, in part, upon the discovery of methods for the generation of two discontinuous 
transcript markers derived from both the 5' and 3' ends of cDNAs contained within a cDNA 
25 library. 

The transcript markers of the present invention are derived from previously constructed 
cDNA libraries. The nucleotide information contained within transcript markers can be used to 
identify novel transcripts or to provide the basis for the design of PCR primers useful in 
extending and identifying a transcript marker nucleotide sequence contained within a cDN A 
30 library. CDNA libraries of the present invention can be constructed by methods which equalize 
the population of cDNAs or bias the population of cDNAs toward rare cDNAs, e.g. by 
normalization techniques, subtraction techniques, treatment with demethylating agents and 
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treatment with differentiating agents, for example. 

The present invention also provides novel methods for achieving genome closure which 

combine rapid nucleic acid sequence identification of transcript markers from a cDNA with 

subsequent extension of the sequence of the identified transcript markers using PCR technology. 
5 Methods of the present invention may also be used to provide a transcript image of specific 

tissues or biological samples. 

IL Construction of cPNA libraries 

CDNA libraries for use in the methods of the present invention may be prepared by 

methods described in Maniatis et al. (1982, Molecular Cloning: a Laboratory Manual. Cold 
10 Spring Harbor Laboratory, Cold Spring Harbor, New York) or any means known to those of skill 

in the art. 

CDNA libraries may be constructed to bias the population of individual cDNAs toward 
desired coding regions. A cDNA library constructed with oligo dT primers would be expected to 
contain nucleic acid sequences from the 3' coding region of a cDNA, as well as 3' untranslated 

15 regions. Additionally, a cDNA library constructed with oligo dT primers may also contain part 
or all of the entire coding region of a particular cDNA. A cDNA library constructed with random 
primers would be expected to contain nucleic acid sequences from all parts of the coding region 
of a cDNA, including the 5* most nucleic acid sequence of the full-length coding region. 

CDNA libraries constructed with selected sets of random primers, such as primers which 

20 specifically prime first strand cDNA synthesis toward the 5' end, such as with the incorporation 
of CapFinder™ PCR construction kit (Clontech K1051-1), are desirable for obtaining 5' 
transcript markers from the 5' most end of a putative transcript or cDNA. 

For gene discovery purposes, preferred methods for the construction of cDNA libraries 
include the use of random primers or a selected set of random primers designed to prime cDNA 

25 synthesis from the 5' end of the mRNA; the use of cell lines treated with demethylating 

compounds, such as, 5-aza-2-deoxycytidine, as starting material for the preparation of mRNA; 
the use of cell lines treated with compounds which induce differentiation, such as retinoic acid; 
normalization or equalization methods; the use of subtractive hybridization techniques in the 
construction of cDNA libraries; and any method that induces the transcription of silent genes or 

30 allows for the identification of rare cDNAs, such as size fractionation of specific cDNA 
populations expected to contain rare cDNAs. 

Preferred methods for the construction of cDNA libraries intended for transcript imaging 
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purposes are those which produce an unbiased population of cDNAs and would include a step for 
the non-amplified or uniform growth of host cells used in constructing the cDNA library. Such 
growth conditions are described i n Current Pro tocols in Molecular Biology "Amplification of 
Cosmid and Plasmid Libraries" Unit 5.10. (1987). 
5 In the normalization process, the prevalence of high-abundance cDN A clones decreases 

dramatically, clones with mid-level abundance are relatively unaffected, and clones for rare 
transcripts are effectively increased in abundance. In the Soares et al. (1994, supra) 
normalization procedure, the abundance levels of individual cDNA clones have been equalized 
by a kinetic re-annealing hybridization. This approach is designed to reduce the initial 10,000- 
10 fold variation in individual cDNA frequencies in order to achieve abundances within one order of 
magnitude while maintaining the overall sequence complexity of the library. 

CDNA libraries prepared by normalization techniques are not an accurate reflection of the 
source tissue's gene-expression profile; however cDNA libraries produced by normalization 
techniques may provide a source of low abundance transcripts and therefore, would be useful for 
15 gene discovery purposes. 

A variety of normalization techniques are known by those of skill in the art for the 
production of normalized cDNA libraries. Weissmann S.M. (1987 Mol. Bio. Med. 4:133-143) 
describe a method based on hybridization to genomic DNA wherein the frequency of each 
hybridized cDNA in the resulting normalized library would be proportional to that of each 
20 corresponding gene in the genomic cDNA. Ko. (1990, Nucleic Acid Res. 18:5705-5711) and 
Patanjali et al (1991, Proc. Natl. Acad. Sci. 88:1943-1947) describe a kinetic approach to 
constructing cDNA libraries. Soares (WO 95/08647, filed September 23, 1994 and published 
March 30, 1995) describe a method to normalize a 3' directional cDNA library. 

Subtractive hybridization of nucleic acids is a method to isolate the coding sequences of a 
25 gene which are differentially expressed such as during development or in disease states. CDNA 
libraries prepared by subtractive hybridization techniques are described in Rubenstein et al (1990, 
N.ir.teir. AHHs Research . 18:4833-4842); Travis et al. (1988, Proc. Natl. Acad. Sci. USA, 
85:1696-1700). 

Obtaining cDNA libraries from cells treated with 5-aza-2'-deoxycytidine should enhance 
30 the discovery of rare genes arid genes expressed in specialized cell types from which it is difficult 
to isolate and/or prepare DNA. A non-toxic concentration of 5-aza-2*-deoxycytidine can be pre- 
determined through titration/toxicity assays as illustrated in Figure 11. 5-aza-2'-deoxycytidine 
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has been shown to induce the transcription of silent genes through demethylation (Hsieh et al 
supra). The preferred amount of 5-aza-2'-deoxycytidine to use is that amount that induces the 
transcription of silent genes without being toxic to the cell. This method can be coupled to 
subtractive methods to enhance further the discovery of novel transcripts or genes. 

5 IL Construction o f ATM Libraries 

ATM libraries containing transcript markers can be constructed using any of the methods 
described in Figures 2-8. Figure 2 illustrates ATM Strategy 1 for constructing a cDNA library 
containing an array of transcript markers derived from the 5* most end of a cDNA (5* transcript 
markers). In Strategy 1, a cDNA library is constructed using an adapter containing a Bpml site 

10 and a PvuII restriction site. The constructed cDNA library is digested with Bpml and PvuII to 
isolate 20 bp 5' transcript markers (14 base pairs of cDNA and 6 base pairs from the adapter) 
from the cDNA library, the isolated markers are treated with T4 DNA polymerase to yield blunt 
ends and the markers are concatenated to form an array or serial arrangement of multiple 
markers. The concatenated array of 5' transcript markers is then used to create a library 

15 containing an array of 5 1 transcript markers. In a variation of this method, individual transcript 
markers can be subjected to subtractive hybridization methods to bias the population toward rare 
transcript markers. 

Another strategy is shown in Figure 3 which illustrates ATM Strategy II for constructing 
a cDNA library containing an array of 5* transcript markers. In Strategy II, a cDNA library is 

20 constructed using an adapter containing a Bpml site and a PvuII restriction site. The constructed 
cDNA library is digested with Bpml and a second degenerate adapter containing a PvuII site is 
ligated onto the Bpml site. After PCR amplification of the template and PvuII digestion, 25 bp 
transcript markers are isolated. The isolated transcript markers are concatenated to form arrays, 
ligated into a vector and transformed into host cells to create a library containing an array of 5* 

25 transcript markers. 

ATM Strategy III, as shown in Figures 4A-4B illustrates the construction of a cDNA 
library containing an array of 5' transcript markers. In Strategy III, a cDNA library is constructed 
using an adapter containing a Bpml site and a PvuII restriction site. The constructed cDNA 
library is digested with Bpml and a second degenerate adapter containing a PvuII site is ligated 

30 onto the Bpml site. Sense RNA is transcribed from the template and the mixture is subjected to 
DNAse treatment. First and second strand cDNA is synthesized from the template. The double 
stranded cDNA is digested with PvuII which yields 25 bp transcript markers. The transcript 
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markers are concatenated to form arrays, the arrays are cloned into a vector and the vector 
transformed into host cells to create a library containing an array of 5* transcript markers. 

ATM strategy IV, as shown in Figures 5 A-5B, illustrates ATM Strategy IV for 
constructing a cDNA library containing an array of 5' transcript markers. In strategy IV, a cDNA 
5 library is constructed using an adapter containing a Bpml and PvuII restriction site. The 

constructed cDNA library is digested with Bpml and PvuII to isolate 20 bp 5' transcript markers 
from the cDNA library, the isolated markers are treated with T4 DNA polymerase to yield blunt 
ends and the markers are concatenated to form an array or serial arrangement of multiple 
markers. Adapters are added on to the ends of the transcript marker arrays and the arrays are 
10 amplified using PCR technology. The arrays are cloned into a vector to create a library 
containing arrays of the transcript markers (ATM). 

Another strategy is shown in Figures 6A-6B which illustrate ATM Strategy V for 
constructing a cDNA library containing an array of 5' transcript markers. In strategy V, a cDNA 
library is constructed using an adapter containing a Bpml and PvuII restriction site. The 
15 constructed cDNA library is digested with Bpml and PvuII to isolate 20 bp 5' transcript markers 
from the cDNA library, the isolated markers are treated with T4 DNA polymerase to yield blunt 
ends and the markers are concatenated to form an array or serial arrangement of multiple 
markers. At this point, the arrays are ligated into a piasmid vector, subjected to PCR to amplify 
the transcript markers, and cloned into a vector to create a cDNA library containing PCR 
20 amplified arrays of 5' transcript markers. 

Figure 7 illustrates a strategy for simultaneously obtaining two non-contiguous transcript 
markers from both the 5' and 3' end. In this strategy, first strand cDNA synthesis is performed 
using a modified random hexamer. The hexamer is designed to provide directionality. Second 
strand cDNA synthesis is performed by standard means and an adapter containing a Bpml and 
25 PvuII site is ligated onto both ends of the cDNA. The cDNA is ligated to a vector modified to 
delete all Bpml restriction sites. The vector containing the cDNA is transformed into a host cell 
to create a cDNA library and piasmid DNA is isolated and treated with Bpml, thereby creating 
individual linearized cDNAs containing both a 5' and 3 f transcript marker. The linearized cDNA 
is treated with T4 polymerase to blunt end, self-ligated and re-transformed to create a library. 
30 The piasmid DNA of the library is isolated and digested with PvuII to excise the nucleic acid 
containing non-contiguous transcript markers from both the 5' and 3' end. The excised transcript 
markers are concatenated and cloned to create a library containing a serial arrangement of non- 
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contiguous transcript markers from the 5' and 3' end which can be subjected to high-throughput 
nucleic acid sequence analysis. In another variation of this method, the vector is constructed to 
contain 5' and 3' Bpml sites at a cloning site thereby providing a means to excise the transcript 
marker from the cDNA. 
5 Figure 8 illustrates a strategy for obtaining non-contiguous transcript markers from 

random areas of a cDNA. In this strategy, mRNA is converted to cDNA following standard 
procedures. The cDNA is ligated to a vector that has been constructed to have restriction 
endonuclease sites for restriction endonucleases having 4 base pair recognition sites removed. 
The cDNA library is amplified and the plasmids prepared. The cDNA library is digested with a 
10 restriction endonuclease having a 4 base pair recognition site and the digested plasmid is purified. 
A 12 base pair adapter containing a Bcgl site, which digests nucleic acid within a range of its 
recognition site, is ligated onto the linearized cDNA, and the cDNA is transformed into a host 
cell. The cDNA library is digested with Bcgl resulting in the release of a 36 base pair fragment 
from each cDNA which originally contained the 4 base pair restriction endonuclease site. The 
15 fragments may be ligated into a vector directly or concatenated and ligated into a vector and 
subjected to sequence specific analysis. 

The ATM libraries of the present invention may contain a single transcript marker per 
individual clone or multiple transcript markers constructed in a serial arrangement. As illustrated 
in the figures, transcript markers may be concatenated, PCR amplified and then ligated into a 
20 vector or concatenated, ligated into a vector and then PCR amplified. 

In a preferred embodiment of the present invention, transcript markers that are excised 
from the cDNA library contain a cDNA portion and a synthetic adapter portion. In a preferred 
embodiment disclosed herein and as shown in Figure 2, the cDNA portion is at least 14 base pairs 
in length and the adapter portion is 6 base pairs which are designed to be asymmetric. The 
25 adapter portion provides the means for determining the sense orientation of the transcript marker 
in the vector as well as a means for determining the beginning of each transcript marker excised. 
Nucleic acid sequencing of the excised transcript markers can be performed to create a nucleic 
acid data set. For nucleic acid sequence analysis purposes, the nucleic acid adapter portion of the 
transcript marker can be subtracted or removed from the transcript marker nucleic acid data set. 
30 The ATM markers may exist in sense or antisense orientation in the vector. The presence 

of an nucleic acid fragment which provides directionality, such as an asymmetric adapter portion, 
provides the means for discerning the sense from the anti-sense stand and provides the means for 
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determining the beginning of each transcript marker. Figure 14A illustrates sequence data from a 
single clone containing an array of 6 transcript markers. The 6 bp "spacer" DNA sequences that 
distinguish one transcript marker from another are underlined. The spacer sequence "CTGGAG" 
indicates that the immediately adjacent 14 bp to the right is a transcript marker (sense strand). 
5 The spacer sequence "CTCCAG" indicates that adjacent 14 bp to the left is a transcript marker 
(antisense strand). Figure 14B provides a list of the six 14 bp transcript markers (sense strand) 
without the spacer DNA sequence derived from the anay in (14a). Asterisks (*) indicate 
sequences which are the reverse complement of the sequence actually found in the array. 
XY* Method for Nucleic Acid Sequencing 

10 Nucleic acid sequencing of transcript markers can be performed by any means known to 

those of skill in the art. Methods for cDNA sequencing employ such enzymes as the Klenow 
fragment of cDNA polymerase I, Sequenase® (US Biochemical Corp, Cleveland OH)), Taq 
polymerase (Perkin Elmer, Norwalk CT), thermostable T7 polymerase (Amersham, Chicago IL), 
or combinations of recombinant polymerases and proofreading exonucleases such as the 

15 ELONGASE Amplification System marketed by Gibco BRL (Gaithersburg MD). Preferably, the 
process is automated with machines such as the Hamilton Micro Lab 2200 (Hamilton, Reno NV), 
Peltier Thermal Cycler (PTC200; MJ Research, Watertown MA) and the AB1 377 cDNA 
sequencers (Perkin Elmer). 
V, P CR Methods 

20 Numerous PCR methods are known to those of skill in the art that would facilitate 

isolation, amplification and/or extension of nucleic acid sequences in the 5' or 3 1 direction. 
Gobinda et al (1993; PCR Methods Applic 2:318-22) disclose "restriction-site" polymerase chain 
reaction (PCR) as a direct method which uses universal primers to retrieve unknown sequence 
adjacent to a known locus. First, cDNA is amplified in the presence of primer to a linker 

25 sequence and a primer specific to the known region. The amplified sequences are subjected to a 
second round of PCR with the same linker primer and another specific primer internal to the first 
one. Products of each round of PCR are transcribed with an appropriate RNA polymerase and 
sequenced using reverse transcriptase. 

Inverse PCR can be used to amplify or extend sequences using divergent primers based 

30 on a known region (Triglia T et al (1988) Nucleic Acids Res 16:8186). Adapters are ligated onto 
cDNAs which then allow cDNAs to be circularized. The intramolecular ligation products then 
serve as PCR templates. The method uses several restriction enzymes to generate a suitable 
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fragment in the known region of a gene. The fragment is then circularized by intramolecular 
ligation and used as a PCR template. 

Capture PCR (Lagerstrom M et al (1991) PCR Methods Applic 1:11 1-19) is another 
method which may be used. The method involves PCR amplification of cDNA fragments 
5 adjacent to a known sequence in human and yeast artificial chromosome cDNA. Capture PCR 
also requires multiple restriction enzyme digestions and ligations to place an engineered 
double-stranded sequence into an unknown portion of the cDNA molecule before PCR. Another 
PCR method which may be used to retrieve sequences is that of Parker JD et al (1991; Nucleic 
Acids Res 19:3055-60). 

10 Capillary electrophoresis may be used to analyze the size or confirm the nucleotide 

sequence of sequencing or PCR products. Systems for rapid sequencing are available from 
Perkin Elmer, Beckman Instruments (Fullerton CA), and other companies. Capillary sequencing 
may employ flowable polymers for electrophoretic separation, four different fluorescent dyes 
(one for each nucleotide) which are laser activated, and detection of the emitted wavelengths by a 

15 charge coupled devise camera. Output/light intensity is converted to electrical signal using 
appropriate software (eg. Genotyper™ and Sequence Navigator™ from Perkin Elmer) and the 
entire process from loading of samples to computer analysis and electronic data display is 
computer controlled. Capillary electrophoresis is particularly suited to the sequencing of small 
pieces of cDNA which might be present in limited amounts in a particular sample. The 

20 reproducible sequencing of up to 350 bp of Ml 3 phage cDNA in 30 min has been reported 
(Ruiz-Martinez MC et al (1 993) Anal Chem 65 :285 1 -2858). 
YL Method for determining Genome Closure 

The present invention provides novel methods for achieving genome closure by 
combining rapid nucleic acid sequence identification of transcript markers from a cDNA with 

25 subsequent extension of the sequence of the identified transcript markers using PCR technology. 
Transcript markers constructed by the method illustrated in Figure 8 which have been 
concatenated and ligated to a vector provide for the sequence specific identification of 15-20 
transcript markers per vector. In one embodiment described herein, PCR technology is used to 
extend the sequence of a transcript marker in an outward direction as described in US Patent 

30 Application 08/487,1 12, filed June 7, 1995, specifically incorporated by reference. One primer is 
synthesized to initiate extension in the antisense direction (XLR) and the other is synthesized to 
extend sequence in the sense direction (XLF). Primers allow the extension of the known 
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sequence "outward" generating amplified nucleotide sequences containing new, unknown 

nucleotide sequence for the region of interest. As shown below, PCR primers which contain the 

sequence of a known adapter joining the non-contiguous transcript markers and all possible 

combinations of nucleotides for the 5 nucleotide positions flanking the adapter are used in PCR 

5 reactions to extend the sequence of the identified transcript markers using PCR technology. 

NNNNN 16basepairadapterNNNNN 

3' 

y 

Considering that for each N position, there could be 4 possible nucleotides, a total of 1024 

10 individual primers would be required for each 5' and 3' extension for a total of 1 ,048,576 

combinations which will be specific enough to amplify any DNA. 

CDNA libraries which have been prepared with oligo dT primers will allow for the 

identification of the 3 f most part of the cDNA and part of the coding region or the complete 5' 

most end of a cDNA. CDNA libraries constructed with random primers constructed in 
15 combination with techniques that result in a high representation of the 5* ends of cDNA, such as 

the Cap-Finder™ PCR construction kit (Clontech), will allow for the amplification of the 5 ! ends 

of genes. 

In another method illustrated in Figure 9, genome closure can be achieved by the 
identification of all transcribed genes from an mRNA source. As illustrated in Figure 9, step 1, a 

20 cDNA library containing a biotinylated poly A tail and having a Not I restriction site is 

constructed by standard means and cloned into a vector from which restriction endonuclease sites 
for 4 base pair restriction endonucleases have been removed. Multiple cDNA libraries derived 
from a variety of mRNA sources can be pooled to provide one sample. In Figure 9, step 2, the 
cDNA is digested with a restriction endonuclease which has a 4 base pair recognition site and the 

25 digested cDNA is captured by streptavidin-beads (Figure 9 step 3). The cDNA is digested with 
Not I which removes the biotin and ligated into a vector having a Type lis restriction 
endonuclease site for MboII (Figure 9, step 4). Other Type lis restriction endonucleases, such as 
Bpml, Bsgl Eco57I and BsmFI, can be used. The cDNA is digested with MboII which cuts 8 
base pairs into the cDNA. A known linker of sufficient size for PCR amplification is cloned into 

30 the Mbo II restriction site and the cDNA pools are subjected to PCR analysis using the primers 
described above. Generation of the PCR-extension products can either be sequenced directly or 
with a wobble primer, ie, a sequencing primer degenerate at the 3'-end to allow for all 3 possible 
bases following the poly A tail, or religated and cloned into a vector and then subjected to nucleic 
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acid sequencing. 

VIL Iran&EDBi Imaging 

Transcript imaging is a method for evaluating changes in gene expression caused by 
factors such as disease progression, pharmacologic treatment and aging. Transcript imaging is 
5 accomplished by sequencing several thousand clones from a particular tissue or cell type and 
electronically recording the abundance levels for each mRNA species identified. Electronic 
manipulations can then be done to examine which mRNAs are up- or down-regulated, or 
unchanged. 

Transcript imaging can be achieved by using transcript markers produced by the present 
10 invention. After nucleic acid sequencing of the transcript markers is accomplished, the 

abundance levels for each transcript marker are electronically recorded. Electronic manipulations 
can then be performed to examine which transcript markers are up- or down- regulated, or 
unchanged. 

INDUSTRIAL APPLICABILITY 
15 L Construction of cDNA Libraries 

The following example describes the construction of a cDNA library. The first step is to 
isolate mRNA from a desired biological tissue or cell source. The mRNA 
is then used in the synthesis of cDNA. 
RNA Isolation 

20 RNA is isolated using guanidinium isothiocyanate and 2-mercaptoethanol lysis, followed 

by ultracentrifugation over a cesium chloride gradient to obtain total RNA (Chirgwin et al). 
Alternatively, total RNA can be isolated using acid/phenol extraction (Chowzisky et al) and 
polyadenylated RNA can be isolated directly using a biotinylated oligo dT primer. An optical 
density measurement is taken to assess the quantity of total RNA isolated, and an aliquot is run 

25 on an electrophoresis gel to assess the quality and integrity of the RNA. The RNA is then stored 
until needed at -80 °C, which prevents degradation. 

In order to obtain cleaner total RNA, each sample is treated with DNAse and acid phenol, 
followed by precipitation and washing. The RNA is again run on an electrophoresis gel to make 
sure it is free of genomic DNA contamination. Subsequent selection of polyadenylated (poly A) 

30 RNA is done with either an oligo(dT)-based affinity column or Oligotex™ latex microspheres. 
The quality of the isolated mRNA is checked confirmed, and the sample is used in cDNA library 
construction. 



-17- 



WO 98/14619 PCT/US97/18344 

rrtN A I ihrflrv Con struction 

Synthesis of first-strand cDNA is initiated using a poly(dT) primer that is complementary 
to the polyA stretch at the 3' end of most transcripts; and the reverse transcriptase enzyme. The 
primer used in this reaction contains a restriction enzyme recognition site (NotI) that permits 

5 directional insertion into an appropriate cloning vector. Second-strand cDNA synthesis is based 
on the method developed by Gubler and Hoffman (1983). RNAse H nicks the RNA/cDNA 
hybrid created in the reverse transcription reaction, creating priming sites for E. coli DNA 
polymerase to synthesize second-strand cDNA. The gaps in the second strand are ligated 
together using E. coli DNA ligase. 

10 After the ends of the cDNA are blunted with T4 or Pfu DNA polymerase, an adaptor is 

ligated to the double-stranded cDNA. This oligonucleotide, which contains an EcoRI-compatible 
sticky end, allows for directional cloning on the cDNA once digestion is complete with NotI, the 
restriction enzyme site found at the 3' terminus of the cDNA. The cDNA is then size-fractionated 
to remove very short cDNAs, which would inhibit the generation of highly complex libraries. At 

15 this point, the cDNA is ligated into a plasmid vector system and transformed into bacterial cells 
for propagation. 

H. rwtmction of ATM cDNA libr aries 

Three cDNA libraries were constructed containing an adapter having the Type lis 
restriction endonuclease, Bpm I, at the 5' end of each cDNA insert. 
20 Two ug each of poly A + RNA from human colon, human prostate, and a single species 

control RNA (bacterial chloramphenicol transferase) were reverse transcribed using an oligo-dT 
primer and Superscript reverse transcriptase according to manufacturer's instructions (Gibco 
Superscript Plasmid System). Following second strand synthesis, 3 ug of the adapter containing 
the Bpm I site were ligated to each cDNA. 

25 The adapter containing the Bpm I site was prepared previously in the following manner: 

two oligonucleotides (5' AATTCAGCTGGAG and 5' phos-CTCCAGCTG) were synthesized 
and purified by HPLC and polyacrylamide gel electrophoresis (New England Biolabs). 
Equimolar amounts of the two oligos were combined in annealing buffer (20 mM Tris, pH 7.4, 2 
mM MgCl 2 , 50 mM NaCl), boiled, and allowed to cool to <30 degrees C in a heating block. 

30 Following adapter ligation, the cDNA was digested with Not I, fractionated over a Sepharose CL- 
4B column and cloned into a pSPORT vector (LTI, Inc.). 

To create an ATM library we followed the procedure as outlined in ATM Strategy 1, as 



-18- 



WO 98/14619 PCT/US97/18344 

illustrated in Figure 2, , starting with prostate mRNA containing the Bpm I adapter described 
above. This library was transformed into E. coli strain DH10B by electroporation and allowed to 
grow either on LB plates or LB broth each supplemented with carbenicillin. Transformed cells 
were collected either by scraping (plates) or by centrifugation (media) and plasmid DNA was 

5 isolated (Promega or Qiagen systems). Greater than 100 ug of plasmid DNA was then digested 
with Bpm I and Pvu II. Multiple aliquots of 20 ug of plasmid were digested for 1.5 hours using 
14 U of Bpm I (New England Biolabs) in a 150 ul volume at 37 degrees C. Thirty units of Pvu II 
(New England Biolabs) were then added and digestion proceeded for an additional hour at 37 
degrees C. The digested DNA was fractionated on 20% TBE acrylamide. A 20 bp band was * 

10 excised, the DNA was recovered by electrocution or by incubating the crushed gel slices in 
tris/EDTA overnight. The transcript markers were treated with T4 DNA polymerase and cloned 
either directly into pSPORT or concatenated. 
2. Analysis of ATM libraries 

ATM libraries were made according to ATM Strategy 1, illustrated in Figure 2. Vector 

15 background of these libraries was determined by screening randomly selected clones from the 
respective libraries by restriction enzyme digestion of plasmids with Bpm I and PvuII, thereby 
releasing transcript marker inserts (n= number of clones screened). Library size is determined by 
subtracting the background from the number of individual colonies or transformants that are 
generated from a single ligation reaction of the transcript markers to pSport vector. Average 

20 number of markers per clone as shown in Table I is determined by both insert size and DNA 
sequence verification. 

TABLE I 

library vector background library size avgJmarkers/clone 

Prostate 1 10%(n=10) 1.1 x 10 6 3.8 

25 Prostate 2 0%(n=12) 0.95 xlO 6 4.9 

Prostate 3 0% (n=12) 0.12 x 10 6 3.8 

Arrays of transcript markers from the prostate ATM library were PCR amplified forjfl 
cycles using Ml 3 forward and reverse primers. These arrays were size selected on a 6% 
acrylamide gel and cloned into a pSPORT-derived vector. Twenty-seven random clones were 
30 sequenced from a total of 1 3 1 total transcript markers analyzed. There was an average of 4.9 
markers per clone with a range of 4-7 markers/clone. Table II illustrates the average marker size 
per number of clones. 
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TABLE II 



marker size 


number (%) 


12 


2(1.5) 


13 


6(4.6) 


14 


102 (77.9) 


15 


14 (10.7) 


16 


1 (0.8) 


14/15 


6 (4.6) 



10 The marker size "14/15" refers to an ambiguous situation where there are 29 bp (instead of 28 bp) 
with markers which are concatenated back to back as shown: 

spacer | 29 bp-two markers back to back — 1-| spacer 

CTGGAGN NNNNNNNNW 
Because the frequency of 13 and 16 bp markers is much lower than 14 and 15 bp markers. 
15 the 29 bp between the spacer sequences are assumed to be composed of one 14 bp and one 1 5 bp 
marker, rather than one 13 bp and one 16 bp marker. Thus, the true total number of 14 bp 
markers was 105 (102+3; 80.2%) and the true total number of 15 bp markers was 17 (14+3; 
13%). 

III. Isolation of a full length cDNA 

20 A transcript marker (5' ccgagaptcgtcgg) was identified from a prostate ATM library 

which corresponds to the gene apoferritin H (GenBank GI numbers: g31340, g31342, g28434). 
Based on the published sequence, this marker is present at the 5' end of the mRNA, 14 
nucleotides downstream from the start of transcription and 181 nucleotides upstream of the start 
of the coding sequence. The entire mRNA is predicted to be about 0.9 kb. 

25 As illustrated below, to isolate the apoferritin H cDNA from the ATM library, a gene 

specific primer ("g31340") that contains 7 nucleotides of the Bpm I adapter and the 14 
nucleotides of the transcript marker (5' gctggagccgagagtcgtcgg) was designed. The cDNA 
inserts from 1 ug of the prostate ATM library were amplified by 30 cycles of PCR using the M13 
forward and reverse primers. This PCR reaction was diluted 1 :50 in water and 1 ul was 
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reamplified for 33 cycles using the nested T7 and u g31340" primers. A 0.9 kb band was isolated, 
gel purified and cloned. DNA sequencing confirmed the identity of the cloned gene as full length 
apoferritin H. Alternatively, the 0.9 kb PCR product is sequenced directly with appropriate 
primers. 

5 Ml 3 forward Ml 3 reverse 

"g31340" "~ ( T7 



5' vector 

10 IV. High Throughput Isolation of cDNA clones 

High throughput isolation of cDNA clones from an ATM library is achieved in the 
following manner. First, a master pool of insert cDNA is created from an ATM library by PCR 
amplification using primers found in the vector (e.g., Ml 3 forward and reverse). Second, gene 
specific primers are synthesized in 96 well arrays (Gibco). Third, aliquots of gene specific 

15 primers, PCR reagents, and the master cDNA pool are aliquoted to 96 PCR wells for PCR and 
subsequently analyzed by gel electrophoresis. In addition, an initial screen for successful PCR 
reactions is accomplished by doing real time flourescent detection of PCR products. With this 
technique only those reactions which give a significant fluorescent signal above background 
would then be analyzed by gel electrophoresis. 

20 V, Extension of Transcript Markers to Full Length 

The nucleic acid sequence of transcript markers can be used to design oligonucleotide 
primers for extending a partial nucleotide sequence to full length or for obtaining 5' sequences 
from genomic libraries. One primer is synthesized to initiate extension in the antisense direction 
(XLR) and the other is synthesized to extend sequence in the sense direction (XLF). Primers 

25 allow the extension of the known sequence "outward" generating amplified nucleotide sequences 
containing new, unknown nucleotide sequence for the region of interest (US Patent Application 
08/487,1 12, filed June 7, 1995, specifically incorporated by reference). The initial primers are 
designed from the cDNA using OLIO® 4.06 Primer Analysis Software (National Biosciences), or 
another appropriate program, to be 22-30 nucleotides in length, to have a GC content of 50% or 

30 more, and to anneal to the target sequence at temperatures about 68°-72° C. Any stretch of 
nucleotides which would result in hairpin structures and primer-primer dimerizations is avoided. 
The original, selected cDNA libraries, or a human genomic library, is used to extend the 
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sequence. A genomic library is most useful to obtain 5' upstream regions. If more extension is 
necessary or desired, additional sets of primers are designed to further extend the known region. 

By following the instructions for the XL-PCR kit (Perkin Elmer) and thoroughly mixing 
the enzyme and reaction mix, high fidelity amplification is obtained. Beginning with 40 pmol of 
5 each primer and the recommended concentrations of all other components of the kit, PCR is 
performed using the Peltier Thermal Cycler (PTC200; MJ Research, Watertown MA) and the 
following parameters: 

Step 1 94° C for 1 min (initial denaturation) 

Step 2 65° C fori min 

10 Step 3 68° C for 6 min 

Step 4 94° C for 15 sec 

Step 5 65° C fori min 

Step 6 68° C for 7 min 

Step 7 Repeat steps 4-6 for 1 5 additional cycles 

15 Step 8 94° C for 15 sec 

Step 9 65° C fori min 

Step 10 68° C for 7:15 min 

Step 1 1 Repeat step 8-10 for 1 2 cycles 

Step 12 72° C for 8 min 

20 Step 13 4° C (and holding) 

A 5-10 yul aliquot of the reaction mixture is analyzed by electrophoresis on a low 
concentration (about 0.6-0.8%) agarose mini-gel to determine which reactions were successful in 
extending the sequence. Bands thought to contain the largest products were selected and cut out 
25 of the gel. Further purification involves using a commercial gel extraction method such as 
QIAQuick™ gel extraction (QI AGEN Inc). After recovery of the DN A, Klenow enzyme was 
used to trim single-stranded, nucleotide overhangs creating blunt ends which facilitate religation 
and cloning. 

After ethanol precipitation, the products are redissolved in 13 ^1 of ligation buffer, 1^1 
30 T4-DNA ligase (15 units) and 1^1 T4 polynucleotide kinase are added, and the mixture is 
incubated at room temperature for 2-3 hours or overnight at 16° C. Competent IL coli cells (in 
40 /A of appropriate media) are transformed with 3 iA of ligation mixture and cultured in 80 /il of 
SOC medium (Sambrook J et al, supra). After incubation for one hour at 37° C, the whole 
transformation mixture is plated on Luria Bertani (LB)-agar (Sambrook J et al, supra) containing 
35 2xCarb. The following day, several colonies are randomly picked from each plate and cultured in 
150 m1 of liquid LB/2xCarb medium placed in an individual well of an appropriate, 
commercially-available, sterile 96-well microtiter plate. The following day, 5 fA of each 
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overnight culture is transferred into a non-sterile 96-weIl plate and after 'dilution 1:10 with water, 

5 /il of each sample is transferred into a PCR array. 

For PCR amplification, 18 /^l of concentrated PCR reaction mix (3.3x) containing 4 units 

of rTth DNA polymerase, a vector primer and one or both of the gene specific primers used for 

5 the extension reaction are added to each well. Amplification is performed using the following 

conditions: 

Stepl 94° C for 60 sec 

Step 2 94° C for 20 sec 

Step 3 55° C for 30 sec 

10 Step 4 72° C for 90 sec 

Step 5 Repeat steps 2-4 for an additional 29 cycles 

Step 6 72° C for 180 sec 

Step 7 4° C (and holding) 

Aliquots of the PCR reactions are run on agarose gels together with molecular weight 
15 markers. The sizes of the PCR products are compared to the original partial cDNAs, and 
appropriate clones are selected, ligated into plasmid and sequenced. 
VI. 5 - aza-2* Deoxycvtidine Treatment of Cells 

5-aza-2'-deoxycytidine induces transcription of silent genes, presumably by 
demethylating cytosines in CpG islands which are regulatory regions located upstream of most 
20 genes, btaining libraries from cells treated with 5-aza-2'-deoxycytidine will enhance the 
discovery of rare genes and genes expressed in specialized cell types difficult to isolate and 
prepare RNA from. 

Methods 

THP1 cells at a density of 1 .1 million cells per ml were treated for three days with 0.8 
25 micromolar 5-aza-2'-deoxycytidine. The medium used for growth conditions was Iscove's 
modified DMEM with 10% Fetal Bovine Serum. 

HNT precursor cells at 80% confluency were treated for three days with 0.35 micromolar 
5-aza-2'-deoxycytidine. The medium used for growth conditions was Iscove's modified DMEM. 

Because 5-aza-2'-deoxycytidine has been shown to be toxic in cultured cells and animal, 
30 initial experiments were conducted to assess the toxicity of S-aza^'-deoxycytidine on hNT and 
THP1 cells to establish conditions where the cells would survive and RNA could be recovered. 
Around 1 micro molar 5-aza-2 r -deoxycytidine is a concentration typically used to induce silent 
gene transcription. A concentration of 0.8 micro molar was selected for THP1 cells and 0.35 
micro molar for hNT cells. 
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Results 

Northern analysis to measure the RNA levels of two identified genes was performed on 
the cells to verify that 5-aza-2'-deoxycytidine was inducing gene transcription under the 
conditions used. Significant induction of both identified genes by 5-aza-2'-deoxycytidine was 

5 obtained in the hNT cells. Many genes were found to be induced by 5-aza-2'-deoxycytidine in 
both THP 1 cells and hNT cells. 

The following 3 genes were in the cDNA library constructed from mRNA from THP cells 
treated with 5-aza-2'deoxycytidine library and not in the control cDNA library: enBank g29382, 
human BBC1 mRNA; GenBank g337507, human ribosomal protein S25 mRNA; and gl 84553 

10 human insulinoma pig-analog mRNA. 

The following genes were in the cDNA library constructed from mRNA from THP cells 
treated with 5-aza-2'deoxycytidine library and in the control cDNA library and found 
upregulated in THP1 : GenBank gl 82055 human neutrophil elastase mRNA, 3' end; g36143 
human mRNA for ribosomal protein SI 1; g793842 human mRNA for ribosomal protein L29; 

15 g28976 human mRNA for azurocidin; and g34891 1 human glycoprotein mRNA. 

The following 3 genes were in the cDNA library constructed from mRNA from hNT cells 
treated with 5-aza-2'deoxycytidine library and not in the control cDNA library: gl 90233 human 
acidic ribosomal phosphoprotein PI; g436217 human mRNA (KIAA0037) for ORF; and 
g385936 hinge=OXPHOS system complex III. 

20 An additional result of 5-aza-2'-deoxycytidine treatment was a decrease in the expression 

of many genes, particularly more abundant mRNAs. 
VII. Construction of the Plasmid pIIEZ-1 

Plasmid pIIEZ-1 was derived from pUC19 vector by removing the 991 base pairs from 
the restriction endonuclease sites Sspl through Afl3. A 90 base pair synthetic polylinker 

25 containing type lis restriction endonuclease sites were ligated to the remaining pUC19 vector, as 
shown in Figure 13. 

The type lis polylinker was made by annealing the two synthetic oligomers P-II-1 and P- 
II-2. The oligomer sequences are: 
P-II-1 
30 5' 

CATGTGCGGCCGCGGCGCGCCGTCAGCTGCACAGATGCGAGCTCCAGGCATTCATCC 
TTAAGTACTTCAGTAGACGTCCCTGCAGGTGAATTC 
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P-II-2 
5' 

GAATTCACCTGCAGGGACGTCTACTGAAGTACTTAAGGATGAATGCCTGGAGCTCGC 
ATCTGTGCAGCTGACGGCGCGCCGCGGCCGCA 
5 After annealing, the double stranded linker has a 5' blunt end which is ligated into the 

Sspl restriction endonuclease site and a 4 bp (5' CATG) extension which is complimentary to the 
Afl3 restriction endonuclease site. 

As illustrated in Figure 13, the poly linker contains 4 type Us restriction endonuclease 

> 

sites which can be used for generation of 14 bp transcript markers. 
10 Standard PCR primer directed site specific mutagenesis is carried out to eliminate all type 

lis restriction endonuclease sites and all 4 base pair restriction endonuclease sites. 

VIIL Genome Closure 

A method directed toward genome closure involves the systematic identification of all 

transcribed genes. The method involves the generation of cDNA's with a defined 5'-end. 
15 generation of defined priming sites for extension PCR amplification from within the cloned 

cDNAs, systematic amplification of all cDNAs, and sequencing of all extension products. 

Generation of cPNA's with a defined S'-end 

As illustrated below, after conversion of mRNA into cDNA with standard methods, the 

cDNA's are digested with a 4 base pair cutter. The cDNAs are ligated into the vector displaying 
20 a type lis restriction endonuclease site at one end. 

1) Generation of cDNAs 

♦Biotin 

cDNA TTTTTTTTTTTTTT NotI 

AAAAAAAAAAAAAA 

25 

2) Digestion with Hpa II 

♦Biotin 

V cDNA V TTTTTTTTTTTTTT NotI 

30 V -V AAAAAAAAAAAAAA 

3) Capture of cDNA's with Streptavidin-beads 
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♦Biotin 

CGG TTTTTTTTTTTTTT NotI 

C AAAAAAAAAAAAAA 

5 

4) Ligation of adapter/linker 

adapter/linker 

♦Biotin 

10 TCGACTAGTACGAAGA CGG TTTTTTTTTTTTTT NotI Y 

GATCATGCTTCTGC C AAAAAAAAAAAAAA bead 

*Biotin 

TCGACTAGTACGAAGACGG TTTTTTTTTTTTTT NotI Y 

15 GATCATGCTTCTGCC AAAAAAAAAAAAAA bead 

5) Not I digest, ligation into pUC 1 8 vector (Not I/Sal I) 

— V TTTTTNotI 

I I 
20 vector 

6) Digestion with Bbs I, cuts 6bp 
Bbs I (digest with Bbs ] 

25 GAAGACGC 

MUM Hill 

CTTCTGCGNNNN 

I 

I 

30 | 

Sail vector- 



Y 

bead 



into cDNA 
[) 



NNNNNNNNN 



NNNNN 



-NotI 
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7) Bbs I (fill in with Klenow) 



GAAGACGCNNNN 



NNNNNNNNN 



llllllllllll 



lllllllll 



5 



CTTCTGCGNNNN 



NNNNNNNNN 



Sail 



-vector- 



NotI 



10 



Generation of defined pri minp sites for extension PCR amplification from within the 
cloned cDNAs. 

The cDNAs are cloned into a vector, amplified and digested with the type lis restriction 
15 endonuclease. Linkers are cloned into the linearized plasmid/cDNA fragment. 
8)BbsI CATGATCATGG 

mum 

GTACTAGTACCGG 

20 9)Ligate on linkers 

GAAGACGCNNNNCATGATCATGG GGCCATGATCATGNNNNNNNNN 

lllllllllllllllllllllll llllllllllllllllllll 

CTTCTGCGNNNNGTACTAGTACCGG GGTACTAGTACNNNNNNNNN 



30 1 0)Treat with T4-DNA polymerase in presence of (dA, dT) 

a) GAAGACGCNNNNCATGATCAT GGCCATGATCATGNNNNNNNNN 



25 



Sail 



-vector- 



--Not! 



Illllllllllllllllllll 



llllllllllll 



mini 
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CTTCTGCGNNNNGTACTAGTACCGG TACTAGTACNNNNNNNNN 



Sail vector NotI 



Sy^matic amplifica tion of all cDNAs 

Extension PCR as described in Example V with combinations of defined primer sets (up to 
1 0-fold degeneracy can be achieved, depending on the type lis enzyme selected in the step 4 
10 adapter). Therefore selective amplification of every cDNA present in the library can be achienved 
in a systematic manner. 



ATCATGGCCATGATCATGN 4 -> 
GAAGACGChlNhWCATGATCATGGCCATGATCATGhJN^^ 

15 lllllllllllllllllllllllllllllllllllllllllllllllllll 

CTTCTGCGNNNNGTACTAGTACCGGTACTAGTACNNNNNNNN^ 

| <- 4 NGTACTAGTACCGGTACTA I 

I I 

I I 
20 Sail vector NotI 

Since NNNN and NNNN are symmetrical combinations for primer pairs are given (e.g. 
5' NNNN 3' = 5' AGCA 3' then 



25 5'ATCATGGCCATGATCATGAGCA 3' will form a pair with 
5'ATCATGGCCATGATCATGTCGT 3' 

Any PCR-reaction will cover 2 primer combinations. 



30 combination 1: 

5'ATCATGGCCATGATCATGAGCA 3' 
5'ATCATGGCCATGATCATGTCGT 3' 
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and 

combination 2: 

5'ATCATGGCCATGATCATGAGCA 3* 
5'ATCATGGCCATGATCATGTCGT 3' 

5 

Exam ple of selective amplification with a 4 fold degenerate primer set resulting in 128 
PQolg of cPNAs 

To cover all possible combinations of 4 bases, 256 oligonucleotides will be 
synthesized (5'ATCATGGCCATGATCATGNNNN 3'). Since every PCR-reaction covers 

10 2 combinations, 128 reactions will cover all possible combinations. A range of 200-300 
PCR-products/reaction will be expected for a specific tissue. Since this is an inverse PCR 
approach, the products are sequenced directly with the wobble primer, increasing the 
resolution by a factor 3. The PCR-products then are directly thermo-cycled with the 
wobble primers resulting in a higher resolution. One colour is choosen for each nucleotide 

15 with no terminator included in the reaction. The resulting products are run on a sequencing 
gel. 

All publications and patents mentioned in the above specification are herein 
incorporated by reference. Various modifications of the described method and system of 
the invention will be apparent to those skilled in the art without departing from the scope 

20 and spirit of the invention. Although the invention has been described in connection with 
specific preferred embodiments, it should be understood that the invention as claimed 
should not be unduly limited to such specific embodiments. Indeed, various modifications 
of the above-described modes for carrying out the invention which are obvious to those 
skilled in the field of molecular biology or related fields are intended to be within the scope 

25 of the following claims. 
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SEQUENCE LISTING 

(1) GENERAL INFORMATION 

(i) APPLICANT: INCYTE PHARMACEUTICALS, INC. 

(ii) TITLE OF THE INVENTION: METHODS FOR GENERATING AND ANALYZING 
TRANSCRIPT MARKERS 

(iii) NUMBER OF SEQUENCES: 50 

(iv) CORRESPONDENCE ADDRESS: 

(A) ADDRESSEE: Incyte Pharmaceuticals, Inc. 

(B) STREET: 3174 Porter Drive 

(C) CITY: Palo Alto 

(D) STATE: CA 

(E) COUNTRY: U.S. 

(F) ZIP: 94304 

(v) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: Diskette 

(B) COMPUTER: IBM Compatible 

(C) OPERATING SYSTEM: DOS 

(D) SOFTWARE: FastSEQ Version 1.5 

(vi) CURRENT APPLICATION DATA: 

(A) PCT APPLICATION NUMBER: To Be Assigned 

(B) FILING DATE: Filed Herewith 

(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: US 08/723,646 

(B) FILING DATE: 03-OCT-1996 

(viii) ATTORNEY/AGENT INFORMATION: 

(A) NAME: Billings, Lucy J. 

(B) REGISTRATION NUMBER: 36,749 

(C) REFERENCE/ DOCKET NUMBER: IN-0001 PCT 

(ix) TELECOMMUNICATION INFORMATION: 

(A) TELEPHONE: 650-855-0555 

(B) TELEFAX: 650-845-4166 

(2) INFORMATION FOR SEQ ID NO:l: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 13 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:l: 
AATTCAGCTG GAG 

(2) INFORMATION FOR SEQ ID NO: 2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 9 base pairs 

(B) TYPE: nucleic acid 
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(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 2: 
CTCCAGCTG 

(2) INFORMATION FOR SEQ ID NO: 3: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 41 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 
CTGGAGNNNN NNNNNNNNNN NNNNNNNNNN NNNNNCTCCA G 
(2) INFORMATION FOR SEQ ID NO: 4: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(vii) IMMEDIATE SOURCE: 

(A) LIBRARY: GenBank 

(B) CLONE: 31340, 31342, 28434 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 
CCGAGAGTCG TCGG 

(2) INFORMATION FOR SEQ ID NO: 5: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 21 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 
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(vii) IMMEDIATE SOURCE: 

(A) LIBRARY: GenBank 

(B) CLONE: 31340 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 
GCTGGAGCCG AGAGTCGTCG G 

(2) INFORMATION FOR SEQ ID NO: 6: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 13 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 
AATTCAGCTG GAG 

(2) INFORMATION FOR SEQ ID NO: 7: 

(i) - SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 35 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:7: 
AATTCAGCTG GAGNNNNNNN NNNNNNNNNC AGCTG 

(2) INFORMATION FOR SEQ ID NO: 8: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 31 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:8: 
GTCGACCTCN NNNNNNNNNN NNNNNGTCGA C 

(2) INFORMATION FOR SEQ ID NO: 9: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 16 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 9: 
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CCCGAGGTCG ACTTAA 

(2) INFORMATION FOR SEQ ID NO: 10: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 
(vi) ORIGINAL SOURCE: 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 10: 
CTGGAGNNNN NNNNNNNNNN NN 

(2) INFORMATION FOR SEQ ID NO: 11: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 11: 
GACCTCNNNN NNNNNNNNNN 

(2) INFORMATION FOR SEQ ID NO: 12: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH : 21 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 12 
GTTGAATACT CATACTCTTC C 

(2) INFORMATION FOR SEQ ID NO: 13: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs , 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 13 
GCTGGCCTTT TGCTCACATG 

(2) INFORMATION FOR SEQ ID NO: 14: 
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(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 60 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 14: 
GAATTCACCT GCAGGGACGT CTACTGAAGT ACTTAAGGAT GAATGCCTGG AGCTCGCATC 
(2) INFORMATION FOR SEQ ID NO: 15: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 29 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 15: 
TGTGCAGCTG ACGGCGCGCC GCGGCCGCA 

(2) INFORMATION FOR SEQ ID NO: 16: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 120 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 16: 

CAACCACCCG GGCCCTCCAG CTGGAGGAAA AAATGCTAGG CTGGAGGGCT GATGTTTTCC 
CTGGAGCTAG TTCTAGATCG CTGGAGCTGC GCCCGGCCCG GGGACCGGCT ATCCCTCCAG 

(2) INFORMATION FOR SEQ ID NO: 17: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 17: 
GGCCCGGGTG GTTG 

(2) INFORMATION FOR SEQ ID NO: 18: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 
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(D) TOPOLOGY: linear 
(ii) MOLECULE TYPE: cDNA 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 
GAAAAAATGC TAGG 

(2) INFORMATION FOR SEQ ID NO: 19: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 19; 
GGCTGATGTT TTCC 

(2) INFORMATION FOR SEQ ID NO:20: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 20 
CTAGTTCTAG ATCG 

(2) INFORMATION FOR SEQ ID NO: 21: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

. (xi) SEQUENCE DESCRIPTION: SEQ ID NO:21 
CTGCGCCCGG CCCG 

(2) INFORMATION FOR SEQ ID NO: 22: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:22 
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GGATAGCCGG TCCC 

(2) INFORMATION FOR SEQ ID NO:23: 

<i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 93 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 23: 

CATGTGCGGC CGCGGCGCGC CGTCAGCTGC ACAGATGCGA GCTCCAGGCA TTCATCCTTA 60 
AGTACTTCAG TAGACGTCCC TGCAGGTGAA TTC 93 



(2) INFORMATION FOR SEQ ID NO: 24: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 89 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:24: 

GAATTCACCT GCAGGGACGT CTACTGAAGT ACTTAAGGAT GAATGCCTGG AGCTCGCATC 
TGTGCAGCTG ACGGCGCGCC GCGGCCGCA 

(2) INFORMATION FOR SEQ ID NO: 25: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 25: 

TTTTTTTTTT TTTT 

(2) INFORMATION FOR SEQ ID NO: 26: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 



(ii) MOLECULE TYPE: cDNA 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 26: 
AAAAAAAAAA AAAA 

(2) INFORMATION FOR SEQ ID NO: 27: 
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(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 16 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:27: 
TCGACTAGTA CGAAGA 

(2) INFORMATION FOR SEQ ID NO: 28: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 14 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 28 
CGTCTTCGTA CTAG 

(2) INFORMATION FOR SEQ ID NO: 29: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 12 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 29 
GAAGACGCNN NN 

(2) INFORMATION FOR SEQ ID NO: 30: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 12 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 30 
NNNNGCGTCT TC 

(2) INFORMATION FOR SEQ ID NO: 31: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single' 

(D) TOPOLOGY: linear 
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(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 31: 
CATGATCATG G 

(2) INFORMATION FOR SEQ ID NO: 32: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 13 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 32: 
GGCCATGATC ATG 

(2) INFORMATION FOR SEQ ID NO: 33: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 23 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 33: 
GAAGACGCNN NNCATGATCA TGG 

(2) INFORMATION FOR SEQ ID NO: 34: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 25 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 34 
GGCCATGATC ATGNNNNGCG TCTTC 

(2) INFORMATION FOR SEQ ID NO: 35: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 35 
GGCCATGATC ATGNNNNNNN NN 
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(2) INFORMATION FOR SEQ ID NO: 36: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 36: 
NNNNNNNNNC ATGATCATGG 

(2) INFORMATION FOR SEQ ID NO: 37: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 21 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 37: 
GAAGACGCNN NNCATGATCA T 

(2) INFORMATION FOR SEQ ID NO: 38: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 25 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 38: 
GGCCATGATC ATGNNNNGCG TCTTC 

(2) INFORMATION FOR SEQ ID NO: 39: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:39: 
GGCCATGATC ATGNNNNNNN NN 

(2) INFORMATION FOR SEQ ID NO: 40: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 18 base pairs 

(B) TYPE: nucleic acid 
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(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 40: 
NNNNNNNNNC AT GAT CAT 

(2) INFORMATION FOR SEQ ID NO: 41: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 41: 
ATCATGGCCA TGATCATGNN NN 

(2) INFORMATION FOR SEQ ID NO: 42: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH : 51 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 42: 
GAAGACGCNN NNCATGATCA TGGCCATGAT CATGNNNNNN NNNNNNNNNN N 
(2) INFORMATION FOR SEQ ID NO: 43: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 51 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:43: 
NNNNNNNNNN NNNNNNNCAT GATCATGGCC ATGATCATGN NNNGCGTCTT C 
(2) INFORMATION FOR SEQ ID NO: 44: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 
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(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 44: 
ATCATGGCCA TGATCATGNN NN 

(2) INFORMATION FOR SEQ ID NO: 45: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 45: 
ATCATGGCCA TGATCATGAG CA 

(2) INFORMATION FOR SEQ ID NO: 46: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 46: 
ATCATGGCCA TGATCATGTC GT 

(2) INFORMATION FOR SEQ ID NO: 47: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 47: 
ATCATGGCCA TGATCATGAG CA 

(2) INFORMATION FOR SEQ ID NO: 48: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 48: 
ATCATGGCCA TGATCATGTC GT 

(2) INFORMATION FOR SEQ ID NO: 49: 
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(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 49: 
ATCATGGCCA TGATCATGAG CA 

(2) INFORMATION FOR SEQ ID NO: 50: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 50: 
ATCATGGCCA TGATCATGTC GT 
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Claims: 

1 . A method for generating and analyzing transcript markers from the 5' most end 
of individual cDNAs of a cDNA library comprising the steps of: 

a) obtaining a cDNA library comprising individual cDNAs having a first 
5 restriction endonuclease site for a restriction endonuclease that digests the cDNA at the 5' 

most end within an expected distance from its recognition site and a second endonuclease 
restriction site; 

b) subjecting the cDNAs to digestion with the first and the second 
restriction endonuclease thereby excising transcript markers from the 5' most end of the 

10 individual cDNAs; 

c) ligatingsaid transcript markers to a vector; 

d) transforming said vector containing the transcript markers in a host cell 
and culturing said host cells; and 

e) performing nucleic acid sequence analysis of the transcript markers. 

15 2. The method of Claim 1 further comprising the step of creating blunt ends on the 

5' transcript markers after step b. 

3. The method of Claim 1 wherein said first restriction enzyme is a Type IIS 
restriction enzyme. 

4. The method of Claim 3 wherein the Type IIS restriction enzyme is selected from 
20 the group consisting of Bpml, Bsgl, Eco57I and BsmFI. 

5. The method of Claim 1 wherein said cDNA library has been constructed by 
normalization techniques. 

6. The method of Claim 1 wherein said cDNA library has been constructed using 
random primers. 

25 7. The method of Claim 1 wherein said cDNA library has been constructed using 

olio dT primers. 

8. The method of Claim 1 wherein the cDNA library has been constructed from 
mRNA treated with a demethylating agent. 

9. The method of Claim 8 wherein the demethylating agent is 5-aza-2' 
30 deoxycytidine. 

10. The method of Claim 1 wherein digesting the cDNA library with the second 
restriction enzyme creates a blunt end. 
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11. The method of Claim 1 optionally comprising the step of concatenating said 5' 
transcript markers after step b) thereby forming a serial arrangement of multiple 5' 
transcript markers. 

12. The method of Claim 1 1 optionally comprising the step of amplifying the serial 
5 arrangement of multiple 5' transcript markers by polymerase chain reaction prior to ligating 

to a vector. 

13. A method for determining a representation of gene expression in a cDNA 

library comprising the steps of: 

a) obtaining a cDNA library comprising individual cDNAs having a first 
10 restriction endonuclease site for a restriction endonuclease that digests the cDNA at the 5' 

most end within an expected distance from its recognition site and a second endonuclease 
restriction site and wherein the cDNA library has been grown under non-amplified growth 
conditions; 

b) generating and isolating 5' transcript markers from the cDNA library 
15 wherein each isolated 5' transcript marker identifies an individual cDNA; 

c) ligating the isolated 5' transcript markers to a vector and transforming the 

vector in a host cell; 

d) culturing said host cells; 

e) performing nucleic acid sequencing on the 5' transcript markers in said 
20 transformed host cells to create a nucleic acid data set; and 

f) subjecting the data set to analysis to determine the relative abundance of 
individual 5' transcript markers thereby determining a representation of gene expression in 
the cDNA library. 

14. The method of Claim 13 comprising concatenating the isolated 5' transcript 

25 markers after step b) thereby creating a serial arrangement of multiple 5* transcript markers. 

15. The method of Claim 13 wherein said cDNA library has been constructed with 
olio dT primers. 

16. The method of Claim 13 wherein after step b) the isolated 5* transcript markers 
are blunt ended. 

30 1 7. A method for rapid nucleic acid sequence analysis of cDNA comprising the 

steps of: 

a) obtaining a cDNA library comprising individual cDNAs having a first 
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restriction endonuclease site for a restriction endonuclease that digests the cDN A at the 5' 
most end within an expected distance from its recognition site and a second endonuclease 
restriction site; 

b) subjecting the cDNA library to digestion with the first restriction 

5 endonuclease and the second restriction endonuclease to create 5' transcript markers which 
identify individual cDNAs; 

c) isolating said 5' transcript markers; 

d) concatenating said 5' transcript markers to create a serial arrangement of 

multiple transcript markers; 
JO e) ligating said serial arrangement of multiple transcript markers to a vector 

and transforming said host cell with the vector; 

f) performing nucleic acid sequencing analysis on said serial arrangement of 
5' transcript markers and identifying transcript markers; 

g) preparing a first and a second PCR primer, wherein the first PCR primer 
15 is designed from a transcript marker identified in step f) and the second PCR primer is 

designed from a section of nucleic acid common to all cDNAs in the cDNA library; 

h) subjecting the cDNA library to a polymerase chain reaction using the 
primers of step g) thereby identifying the cDNA designated by a transcript marker; and 

i) performing nucleic acid sequencing on the identified DNA. 

20 1 8. The method of Claim 1 7 wherein wherein the polymerase chain reaction of step 

h) is performed in 96 well plates. 

19. The method of Claim 17 optionally comprising the step of amplifying the serial 
arrangement of multiple 5' transcript markers by polymerase chain reaction after step e). 

20. The method of Claim 17 further comprising the step of creating blunt ends on 
25 the 5' transcript markers after step c. 

21. The method of Claim 17 wherein said first restriction enzyme is a Type IIS 
restriction enzyme. 

22. The method of Claim 1 7 wherein the Type IIS restriction enzyme is selected 
from the group consisting of Bpml, Bsgl, Eco57I and BsmFI 

30 23. The method of Claim 17 wherein said cDNA library has been constructed by 

normalization techniques. 

24. The method of Claim 17 wherein said cDNA library has been constructed using 
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random primers. 

25. The method of Claim 17 wherein said cDNA library has been constructed using 
olio dT primers. 

26. The method of Claim 17 wherein the cDNA library has been constructed from 
5 mRNA treated with a demethylating agent. 

27. The method of Claim 26 wherein the demethylating agent is 5- aza-2'- 

deoycytidine. 

28. The method of Claim 17 wherein digesting the cDNA library with the second 
restriction enzyme creates a blunt end. 

10 29. A method for the rapid, sequence-specific identification of cDNAs derived from 

a human mRNA population, comprising the steps of: 

a) obtaining a cDNA library comprising individual cDNAs wherein said 
cDNAs contain first restriction endonuclease sites for an endonuclease having a 4 base pair 
recognition site and wherein said cDNAs are cloned into a first vector lacking the first 

15 restriction endonuclease sites; 

b) subjecting the cDNA library to digestion with the first restriction 
endonuclease thereby creating linearized cDNAs containing a portion of the original 
cDNA; 

c) ligating an adapter to said linearized cDN As wherein the adapter contains 
20 a second restriction endonuclease site for an endonuclease that cleaves within an expected 

range of its recognition site thereby creating cDNAs containing two non-contiguous 
transcript markers joined by the adapter; 

d) digesting the cDNAs of step c) with the second restriction endonuclease 
thereby excising the transcript markers from said cDNAs; 

25 e) concatenating said excised transcript markers; 

f) ligating said concatenated transcript markers to a second vector; 

g) transforming said second vector containing the transcript markers in a 
host cell and culturing said host cells; and 

h) performing nucleic acid sequence analysis on the transcript markers in 

30 said host cells. 

30. The method of Claim 29 optionally comprising concatenating the transcript 
markers after step e) to create a serial arrangement of multiple transcript markers. 
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31 . The method of Claim 29 wherein the cDNA library has been constructed from 
mRNA treated with a demethylating agent. 

32. The method of Claim 29 wherein the demethylating agent is 5-aza-2' 
deoxycytidine. 

5 33. The method of Claim 29 wherein the cDNA library has been constructed by 

normalization techniques. 

34. The method of Claim 29 wherein the second endonuclease is Bcgl. 

35. The method of Claim 29 wherein the nucleic acid sequence analysis comprises 
amplifying transcript markers in an outward manner using 2 specific primers and 

10 sequencing directly using a wobble primer. 

36. The vector as shown in Figure 13. 

37. A method for generating and analyzing non-contiguous transcript markers 
derived from the 5 ! most and 3 1 most end of individual cDNAs, comprising the steps of: 

a) obtaining a cDNA library comprising individual cDNAs having a first 
15 restriction endonuclease site at the 5' most end and at the 3' most end for a restriction 

endonuclease that digests nucleic acid within an expected distance from the first 
endonuclease recognition site, and a second endonuclease restriction site; 

b) subjecting the cDNAs to digestion with the first endonuclease thereby 
creating linearized cDNAs containing transcript markers from the 5' most end and the 3' 

20 most end of the individual cDNAs; 

c) self-ligating said linearized cDNAs thereby joining the transcript markers 
from the 5' most and 3' most ends to create cDNAs containing non-contiguous transcript 
markers; 

d) transforming said linearized cDNAs in a host cell and culturing said host 

25 cells; 

e) isolating said cDNA from said host cells; 

f) digesting the cDNA with the second restriction endonuclease thereby 
excising the non-contiguous transcript markers; 

g) ligating said excised transcript markers to a second vector; 

30 f) transforming said second vector containing the transcript markers in a 

host cell and culturing said host cells; and 

h) performing nucleic acid sequence analysis of the transcript markers. 
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38. The method of Claim 37 farther comprising creating blunt ends on the 5' 

transcript markers after step b. 

39. The method of Claim 37 wherein said first restriction enzyme is a Type IIS 

restriction enzyme. 

5 40. The method of Claim 37 wherein the Type IIS restriction enzyme is selected 

from the group consisting of Bpml, Bsgl, Eco57I and BsmFI. 

41 The method of Claim 37 wherein said cDNA library has been constructed by 

normalization techniques. 

42. The method of Claim 37 wherein said cDNA library has been constructed using 

10 random primers. 

43. The method of Claim 37 wherein said cDNA library has been constructed using 
olio dT primers. 

44. The method of Claim 37 wherein the cDNA library has been constructed from 
mRNA treated with a demethylating agent. 

1 5 45. The method of Claim 44 wherein the demethylating agent is S-aza-2* 

deoxycytidine. 

46. The method of Claim 37 wherein digesting the cDNA library with the second 
restriction enzyme creates a blunt end. 

47. The method of Claim 37 comprising concatenating said 5' transcript markers 
20 after step b) thereby forming a serial arrangement of multiple 5' transcript markers. 

48. The method of Claim 47 comprising amplifying the serial arrangement of 
multiple 5' transcript markers by polymerase chain reaction prior to ligating to a vector. 
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